library(tidyverse)
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
library(latex2exp)
library(gridExtra)
library(wesanderson)
theme_set(theme_minimal())
Distinguish between (1) residual and semistudentized residual, (2) \(\text{E}[\varepsilon_i] = 0\) and \(\overline{e} = 0\) and (3) error term and residual.
Answer: The semistudentized residual is the residual divided by an approximation of the standard deviation of the residual itself. \(\overline{e} = 0\) is the declaration that the mean of the residuals calculated from the simple linear regression model is zero while \(\text{E}[\varepsilon_i] = 0\) is the declaration that the true errors have an expected value of \(0\). The difference between error term and residual is that the residual is meant to be the observed error between the observed value and the fitted value while the error term is is meant to be the true error error in the regression model.
Prepare a prototype residual plot for each of the following cases: (1) error variance decreases with \(X\); (2) true regression function is \(\bigcup\) shaped, but a linear regression function is fitted.
Answer:
lm.fit_manual = function(X, Y){
b1 = sum((X - mean(X))*(Y - mean(Y))) / (sum((X - mean(X))^2))
b0 = mean(Y) - b1*mean(X)
return(c(b0, b1))
}
In this scenario, the error variance decreases with \(X\):
set.seed(2020)
x = sample(1:100, 100, replace = TRUE)
error = rnorm(mean = x, sd = 100-x, n = 100)
y = rexp(x) + error
model = lm.fit_manual(x, y)
pred = model[1] + model[2]*x
resid = y - pred
df = data.frame(x, y, pred, resid)
plot1 = df %>% ggplot(aes(x=x, y=y)) +
geom_point(size = .8) +
geom_abline(intercept = model[1], slope = model[2], color = "firebrick") +
labs(x = "x", y = "y", title = "Scatter Plot",
subtitle = "where error variance decreases with x")
plot2 = df %>% ggplot(aes(x=x, y=resid)) +
geom_point(size = .8) +
geom_hline(yintercept = 0, color = "firebrick") +
labs(x = "x", y = "residual",
title = "Residual Plot",
subtitle = "where error variance decreases with x")
grid.arrange(plot1, plot2, ncol = 2)
In this scenario, the true regression function is \(\bigcup\) shaped but a linear regression function is fitted.
y = x^2
model = lm.fit_manual(x, y)
pred = model[1] + model[2]*x
resid = y - pred
df = data.frame(x, y, pred, resid)
plot1 = df %>% ggplot(aes(x=x, y=y)) +
geom_point(size = .8) +
geom_abline(intercept = model[1], slope = model[2], color = "firebrick") +
labs(x = "x", y = "y", title = "Scatter Plot",
subtitle = "where there is a misspecification")
plot2 = df %>% ggplot(aes(x=x, y=resid)) +
geom_point(size = .8) +
geom_hline(yintercept = 0, color = "firebrick") +
labs(x = "x", y = "residual",
title = "Residual Plot",
subtitle = "where there is a misspecification")
grid.arrange(plot1, plot2, ncol = 2)
Refer to Grade point average Problem 1.19.
Answer: The box plot is plotted below.
gpa = read.csv('CH01PR19.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
gpa %>%
ggplot(aes(x = '', y = x)) +
geom_boxplot(fill = "deepskyblue4") +
labs(x = '', y = 'ACT Scores', title = "Box Plot of ACT Scores") +
coord_flip()
The median ACT score is 25 while scores between 21 and 28 lie between the first and third quartiles. There are no outliers.
Answer: The dot plot of the residuals is plotted below.
model = lm.fit_manual(gpa$x, gpa$y)
pred = model[1] + model[2]*gpa$x
resid = gpa$y - pred
gpa_resids = data.frame(x = gpa$x, y = gpa$y, pred, resid)
gpa_resids %>% ggplot(aes(x = resid)) +
geom_dotplot(binwidth = .1, fill = "darkseagreen", color = "black") +
scale_y_continuous(breaks = NULL, name = '') +
labs(x = "Residuals",
title = "Dot Plot of the Residuals", subtitle = "of Regressing ACT Scores on GPA")
The residuals are skewed to the left and centered around \(0\). There are two large residuals in the negatives.
Answer: The residuals are plotted against the fitted values below.
gpa_resids %>% ggplot(aes(x = pred, y = resid)) +
geom_point() +
labs(x = "Yhat", y = "e_i",
title = "Residual Plot", subtitle = "of Regressing ACT Scores on GPA")
The residuals are distributed randomly. A linear relationship between GPA and ACT scores is probable since there is no pattern in the distribution of the residuals. There appears to be a constant variance.
Answer: The normal probability plot is plotted below.
prob_plot_table = function(df){
model = lm.fit_manual(df$x, df$y)
pred = model[1] + model[2]*df$x
resid = df$y - pred
n = nrow(df)
resids_df = data.frame(x = df$x, y = df$y, pred, resid)
mse_sq_root = sqrt(sum((resids_df$y - resids_df$pred)^2) / (n - 2))
resids_df = resids_df %>% mutate("Run" = 1:nrow(resids_df),
"k" = rank(resid)) %>%
mutate("exp_val" = mse_sq_root * qnorm((k - .375)/(n + .25)))
return(resids_df)
}
prob_plot = function(df, subtitle){
prob_plot_table(df) %>%
ggplot(aes(x = resid, y = exp_val)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
labs(x = "Residuals", y = "Expected Value of Residuals Under Normality",
title = "Normal Probablity Plot",
subtitle = subtitle)
}
The normal probability plot of the residuals in estimating GPA using ACT scores is shown below.
prob_plot(gpa, "GPA Scores")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(gpa) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9737275
## exp_val 0.9737275 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .05\) and \(n=120\), is \(.9889\). Since the calculated correlation coefficient is less than the critical value, it can be stated that there is no evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points do not line up.
Answer:
brown.forsythe = function(df, split, alpha){
df = prob_plot_table(df)
n = nrow(df)
group1 = df[df$x < split,]
n1 = nrow(group1)
e1_median = median(group1$resid)
d1 = abs(group1$resid - e1_median)
d1_bar = mean(d1)
group2 = df[df$x >= split,]
n2 = nrow(group2)
e2_median = median(group2$resi)
d2 = abs(group2$resid - e2_median)
d2_bar = mean(d2)
s = sqrt((sum((d1 - d1_bar)^2) + sum((d2 - d2_bar)^2))/(n-2))
t_ast = abs(round((d1_bar - d2_bar) / (s * sqrt((1/n1) + (1/n2))), 3))
if(t_ast < qt(1 - (alpha/2), n-2)){
paste("At the alpha level of", alpha, "the test statistic is", t_ast, "and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant.", sep = ' ')
}
else{
paste("At the alpha level of", alpha, "the test statistic is", t_ast, "and the null hypothesis is rejected. There is evidence that the error variance is nonconstant.", sep = ' ')
}
}
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
brown.forsythe(gpa, 26, 0.01)
## [1] "At the alpha level of 0.01 the test statistic is 0.897 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
Answer:
gpa_fulldata = read.csv("CH03PR03.txt", sep = '', header = FALSE,
col.names = c("y", "x1", "x2", "x3"),
colClasses = rep("numeric", 4))
gpa_fulldata_resids = merge(prob_plot_table(gpa), gpa_fulldata)
plot_1 = gpa_fulldata_resids %>% ggplot(aes(x = x2, y = resid)) + geom_point() +
labs(x = "Intelligence Test Score", y = "Residuals",
title = "Residuals against Intellligence Test Score")
plot_2 = gpa_fulldata_resids %>% ggplot(aes(x = x3, y = resid)) + geom_point() +
labs(x = "High School Class Rank Percentile", y = "Residuals",
title = "Residuals against HS Class Rank Percentile")
grid.arrange(plot_1, plot_2, nrow = 2)
The residuals increase as intelligence test scores increase and stay relatively constant with increasing high school class rank percentile.
Refer to Copier maintenance Problem 1.20.
Answer:
The dot plot for the number of copiers serviced is shown below.
copier = read.csv('CH01PR20.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
copier %>%
ggplot(aes(x = x)) + geom_dotplot(fill = "darkseagreen", color = "black") +
scale_x_continuous(breaks = seq(0, 10, by = 2)) +
scale_y_continuous(breaks = NULL, name = '') +
labs(x = "Number of Copiers", y = '', title = "Dot Plot of the Number of Copiers Serviced")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
The number of copiers serviced appears to be distributed slightly right skewed. There are no outlying cases with respect to this variable.
Answer:
The time plot for the number of copiers serviced is shown below.
copier %>% mutate(time = 1:nrow(.)) %>%
ggplot(aes(x=time, y = x)) +
geom_point() +
geom_line(color = "deepskyblue2") +
labs(x = "time", y = "Number of Copiers",
title = "Number of Copiers Serviced", subtitle = "in time order")
As time increases, there does not appear to be any overall trend in the number of copiers being serviced. It is noted however that number of copiers serviced increases and decreases roughly every 3/4 measurement of time.
Answer:
The stem-and-leaf plot of the residuals is shown below.
model = lm.fit_manual(copier$x, copier$y)
pred = model[1] + model[2]*copier$x
resid = copier$y - pred
df = data.frame(copier$x, copier$y, pred, resid)
stem(df$resid)
##
## The decimal point is 1 digit(s) to the right of the |
##
## -2 | 30
## -1 |
## -1 | 3110
## -0 | 99997
## -0 | 44333222111
## 0 | 001123334
## 0 | 5666779
## 1 | 112234
## 1 | 5
The residuals appear to hover around 0 and are normally distributed.
Answer:
The residual plots of \(e_i\) versus \(\hat{Y}_i\) and \(e_i\) versus \(X_i\) is shown below.
plot_1 = df %>% ggplot(aes(x = pred, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "prediction", y = "residual",
title = "Predictions vs Residuals",
subtitle = "of Number of Copiers Serviced and Service Time")
plot_2 = df %>% ggplot(aes(x = copier.x, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "number of copiers", y = "residual",
title = "Number of Copiers Serviced vs Residual")
grid.arrange(plot_1, plot_2, nrow = 2)
These two plots provide the same information. The nonlinearity of the regression function can be studied from these plots. It is found that there is a reasonably random distribution of points, indicating that there is a linear relationship between the number of copiers serviced and service time.
Answer:
The normal probability plot of the residuals is shown below.
prob_plot(copier, "of the Residuals of Predicting Service Time by Number of Copiers Serviced")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(copier) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9892079
## exp_val 0.9892079 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .10\) and \(n=45\), is \(.981\). Since the calculated correlation coefficient is greater than the critical value, it can be stated that there is evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points appeared to line up.
Answer:
The time plot of the residuals is shown below.
df %>% mutate(time = 1:nrow(.)) %>%
ggplot(aes(x = time, y = resid)) +
geom_point() +
geom_line(color = "deepskyblue2") +
labs(x = "time", y = "Residuals",
title = "Residuals in Predictions", subtitle = "in time order")
There does not appear to be any correlation in error terms over time.
Answer:
breusch.pagan = function(df, alpha){
model = lm.fit_manual(df$x, df$y)
pred = model[1] + model[2]*df$x
resid = df$y - pred
df_new = data.frame(x = df$x, y = df$y, pred = pred, resid = resid, resid_sq = resid^2)
sse = sum(resid^2)
model_sq_error = lm.fit_manual(df_new$x, df_new$resid_sq)
pred_sq_error = model_sq_error[1] + model_sq_error[2]*df_new$x
ssr_ast = sum((pred_sq_error - mean(df_new$resid_sq))^2)
chi_sq_BP = round((ssr_ast/2) / ((sse/nrow(df))^2), 3)
if(chi_sq_BP < qchisq(1 - alpha, 1)){
paste("At the alpha level of", alpha, "the test statistic is", chi_sq_BP, "and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant.", sep = ' ')
}
else{
paste("At the alpha level of", alpha, "the test statistic is", chi_sq_BP, "and the null hypothesis is rejected. There is evidence that the error variance is nonconstant.", sep = ' ')
}
}
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
breusch.pagan(copier, alpha = .05)
## [1] "At the alpha level of 0.05 the test statistic is 1.315 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
Answer:
The residuals against \(X_2\) and \(X_3\) are shown below.
copier_fulldata = read.csv('CH03PR04.txt', sep = '', header = FALSE,
col.names = c('y', 'x1', 'x2', 'x3'),
colClasses = rep('numeric', 4))
copier_fulldata_resids = merge(prob_plot_table(copier), copier_fulldata)
plot_1 = copier_fulldata %>% ggplot(aes(x = x2, y = resid)) + geom_point() +
labs(x = "Mean Operational Age of Copiers. in months", y = "Residuals",
title = "Residuals against Mean Operational Age of Copiers")
plot_2 = copier_fulldata %>% ggplot(aes(x = x3, y = resid)) + geom_point() +
labs(x = "Years of Experience of the Service Person", y = "Residuals",
title = "Residuals against Years of Experience of the Service Person")
grid.arrange(plot_1, plot_2, nrow = 2)
Residuals increase as the mean operational age of copiers, in months, increases, indicating that the model does not predict well for older copiers. There does not seem to be any pattern in model residuals and years of experience of the service person making the call.
Refer to Airfreight breakage Problem 1.21.
Answer:
The dot plot for the number of transfers is shown below.
airfreight = read.csv('CH01PR21.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
airfreight %>%
ggplot(aes(x = x)) + geom_dotplot(fill = "darkseagreen", color = "black") +
scale_y_continuous(breaks = NULL, name = '') +
labs(x = "Number of Transfers", y = '', title = "Dot Plot of the Number of Transfers")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of the number of transfers appears to be assymmetrical, where as the number of transfers increases, the number of such cases decreases.
Answer:
The time plot of the number of transfers is shown below.
airfreight %>% mutate(time = 1:nrow(.)) %>%
ggplot(aes(x = time, y = x)) +
geom_point() +
geom_line(color = "deepskyblue2") +
scale_x_continuous(breaks = 1:10) +
labs(x = "time", y = "Number of Transfers",
title = "Number of Transfers", subtitle = "in time order")
There does not appear to be any systematic evidence of any trend in the number of transfers over time.
Answer:
The stem-and-leaf plot of the residuals is shown below.
model = lm.fit_manual(airfreight$x, airfreight$y)
pred = model[1] + model[2]*airfreight$x
resid = airfreight$y - pred
df = data.frame(x = airfreight$x, y = airfreight$y, pred, resid)
stem(df$resid)
##
## The decimal point is at the |
##
## -2 | 2
## -1 | 222
## -0 | 2
## 0 | 888
## 1 | 88
There are two high peaks in the residuals.
Answer:
The plots of the residuals \(e_i\) against \(X_i\) is shown below.
df %>% ggplot(aes(x = x, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "number of transfers", y = "residual",
title = "Number of Transfers vs Residual")
As the number of transfers increases, the residuals become smaller and smaller. This is a sign of nonconstant variance of the error terms.
Answer:
The normal probability plot of the residuals is shown below.
prob_plot(df, "of the Residuals of Predicting Breakage using Number of Transfers")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(airfreight) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9913394
## exp_val 0.9913394 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .01\) and \(n=10\), is \(.879\). Since the calculated correlation coefficient is greater than the critical value, it can be stated that there is evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points appeared to line up.
Answer:
The time plot of the residuals is shown below.
df %>% mutate(time = 1:nrow(.)) %>%
ggplot(aes(x = time, y = resid)) +
geom_point() +
geom_line(color = "deepskyblue2") +
labs(x = "time", y = "Residuals",
title = "Residuals in Predictions", subtitle = "in time order")
There is no pattern in the residuals as time passes, meaning there is no correlation in the error terms over time.
Answer:
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
breusch.pagan(airfreight, alpha = .1)
## [1] "At the alpha level of 0.1 the test statistic is 1.033 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
This conclusion is supported by the preliminary findings in part (d).
Refer to Plastic hardness Problem 1.22.
Answer:
The boxplot of the residuals is shown below.
plastic = read.csv('CH01PR22.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
plastic_model = prob_plot_table(plastic)
plastic_model %>%
ggplot(aes(x = '', y = resid)) +
geom_boxplot(fill = "deepskyblue4") +
labs(x = '', y = 'Residual',
title = "Box Plot of the Residuals of Predicting Hardness of Plastic using Time Elapsed") +
coord_flip()
The residuals appear to be symmetrically distributed, with a median slightly above 0
Answer:
The residuals \(e_i\) are plotted against the fitted values \(\hat{Y}_i\) below.
plastic_model %>%
ggplot(aes(x = pred, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "prediction", y = "residual",
title = "Predictions vs Residuals",
subtitle = "of Time Elapsed and Hardness of Plastic")
The residuals appear to be random yet have a downward trending slope. This could mean the variance of the error term is nonconstant.
Answer:
The normal probability plot of the residuals is shown below.
prob_plot(plastic, "of the Residuals between Time Elapsed and Hardness of Plastic")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(plastic) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9916733
## exp_val 0.9916733 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .05\) and \(n=16\), is \(.941\). Since the calculated correlation coefficient is greater than the critical value, it can be stated that there is evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points appeared to line up.
Answer:
expected_freq = function(df){
model = lm.fit_manual(df$x, df$y)
pred = model[1] + model[2]*df$x
resid = df$y - pred
n = nrow(df)
resids_df = data.frame(x = df$x, y = df$y, pred, resid)
mse_sq_root = sqrt(sum((resids_df$y - resids_df$pred)^2) / (n - 2))
resids_df = resids_df %>% mutate("Run" = 1:nrow(resids_df),
"k" = rank(resid)) %>%
mutate("exp_val_25" = mse_sq_root * qt((k - .375)/(n + .25), .25),
"exp_val_50" = mse_sq_root * qt((k - .375)/(n + .25), .5),
"exp_val_75" = mse_sq_root * qt((k - .375)/(n + .25), .75))
return(resids_df)
}
expected_freq_table = expected_freq(plastic)
plot_1 = expected_freq_table %>%
ggplot(aes(x = resid, y = exp_val_25)) +
geom_point() +
geom_text(aes(x = 0, y = -20000, label = "25%")) +
labs(x = '',
y = "Expected Value of Residuals Under Normality",
title = "Normal Probablity Plot using Various Expected Frequencies under Normality")
plot_2 = expected_freq_table %>%
ggplot(aes(x = resid, y = exp_val_50)) +
geom_point() +
geom_text(aes(x = 0, y = -200, label = "50%")) +
labs(x = "Residuals", y = '', title = '')
plot_3 = expected_freq_table %>%
ggplot(aes(x = resid, y = exp_val_75)) +
geom_point() +
geom_text(aes(x = 0, y = -45, label = "75%")) +
labs(x = '', y = '', title = '')
grid.arrange(plot_1, plot_2, plot_3, nrow = 1)
The information provided by these comparisons is consistent with the findings from the normal probability plot in part (c). As the percentile of the releveant \(t\) distribution increases, the points start to align themselves on a \(y=x\) line and the scale of the expected value of the residual becomes smaller.
Answer:
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
brown.forsythe(plastic, 25, 0.05)
## [1] "At the alpha level of 0.05 the test statistic is 0.856 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
Refer to Muscle mass Problem 1.27.
Answer:
A stem-and-leaf plot of the ages is shown below.
muscle = read.csv('CH01PR27.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
stem(muscle$x)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 4 | 11122334
## 4 | 5677788
## 5 | 123344
## 5 | 56777999
## 6 | 00013334
## 6 | 5556889
## 7 | 001223
## 7 | 5666788888
It appears to be that there is randomness in the ages of each age group of women. The distribution appears to be uniform with no skewness in either ends, nor is there any sense of normality.
Answer:
The dot plot of the residuals is shown below.
muscle_model = prob_plot_table(muscle)
muscle_model %>%
ggplot(aes(x = resid)) +
geom_dotplot(binwidth = 2, fill = "darkseagreen", color = "black") +
scale_y_continuous(breaks = NULL, name = '') +
labs(x = "Residuals",
title = "Dot Plot of the Residuals", subtitle = "of Regressing Muscle Mass on Age")
The residuals are skewed to the right due to the presence of an outlier.
Answer:
The plot of the residual \(e_i\) against \(\hat{Y}_i\) and also against \(X_i\) is shown below.
plot_1 = muscle_model %>% ggplot(aes(x = pred, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "prediction", y = "residual",
title = "Predictions vs Residuals",
subtitle = "of Age on Muscle Mass ")
plot_2 = muscle_model %>% ggplot(aes(x = x, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "age", y = "residual",
title = "Age vs Residual")
grid.arrange(plot_1, plot_2, nrow = 2)
These two plots are showing the same information yet reversed. This is because of the negative trend between muscle mass and age. Nonetheless, since the residuals appear to be scattered and not follow a particular pattern, the error terms should have a constant variance. Residuals are slightly larger at the end of 1 side than the other.
Answer:
The normal probability plot of the residuals is shown below.
prob_plot(muscle, "of the Residuals between Muscle Mass and Age")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(muscle) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9897499
## exp_val 0.9897499 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .10\) and \(n=60\), is \(.984\). Since the calculated correlation coefficient is greater than the critical value, it can be stated that there is evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points appeared to line up.
Answer:
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
breusch.pagan(muscle, alpha = .01)
## [1] "At the alpha level of 0.01 the test statistic is 3.817 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
Refer to Crime rate Problem 1.28.
Answer:
The stem-and-leaf plot of the percentage of individuals in the county having at least a high school diploma is shown below.
crime = read.csv('CH01PR28.txt', sep = '', header = FALSE,
col.names = c('y', 'x'),
colClasses = c('numeric', 'numeric'))
stem(crime$x)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 6 | 1444
## 6 | 5678
## 7 | 00334444
## 7 | 5555666677777778888888999999
## 8 | 000011111112222222233333344444
## 8 | 55578889
## 9 | 11
There is much higher number of percentages in the 70s and 80s range than either ends. The percentages are distributed normally.
Answer:
The box plot of the residuals is shown below.
crime_model = prob_plot_table(crime)
crime_model %>%
ggplot(aes(x = '', y = resid)) +
geom_boxplot(fill = "deepskyblue4") +
labs(x = '', y = 'Residual',
title = "Box Plot of the Residuals of Predicting Crime Rate",
subtitle = "using % of People with High School Diplomas") +
coord_flip()
The distribution of the residuals does not appear to be symmetrical.
Answer:
The residual plot of \(e_i\) against \(\hat{Y}_i\) is shown below.
crime_model %>%
ggplot(aes(x = pred, y = resid)) +
geom_point(color = "cadetblue") +
labs(x = "prediction", y = "residual",
title = "Predictions vs Residuals",
subtitle = "of Crime Rate")
The residuals appear to be more varied in the middle range of predictions whereas at either ends it congregates towards \(0\).
Answer:
The normal probability plot of the residuals is shown below.
prob_plot(crime, "of the Residuals in Predicting Crime Rate using % of HSD")
The coefficent of correlation between the ordered residuals and their expected values under normality is
prob_plot_table(crime) %>% select(resid, exp_val) %>% cor()
## resid exp_val
## resid 1.0000000 0.9887589
## exp_val 0.9887589 1.0000000
There is a high correlation between the ordered and expected value of the residuals. Using Table B.6, it is found that the critical value for the coefficient of correlation between ordered residuals and expected values under normality, at \(\alpha = .05\) and \(n=84\), is \(.985\). Since the calculated correlation coefficient is greater than the critical value, it can be stated that there is evidence that that the error terms are reasonably normally distributed. This was also evident in the plot above where the points appeared to line up.
Answer:
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
brown.forsythe(crime, 69, 0.05)
## [1] "At the alpha level of 0.05 the test statistic is 0.355 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
Electricity consumption. An economist studying the relation between household electricity (\(Y\)) and number of rooms in the home (\(X\)) employed linear regression model (2.1) and obtained the residuals. Plot the residuals against \(X_i\). What problem appears to be present here? Might a transformation alleviate the problem?
Answer:
A plot of the residuals against \(X_i\) is shown below.
electricity = read.csv('CH03PR09.txt', sep = '', header = FALSE,
col.names = c('x', 'e'),
colClasses = c('numeric', 'numeric'))
electricity %>%
ggplot(aes(x = x, y = e)) +
geom_point(color = "cadetblue") +
labs(x = "number of rooms", y = "residual",
title = "Number of Rooms vs Residual")
Residuals appear to become large and small in no detectable pattern. A transformation might alleviate this problem and find a viable way to correct the nonlinearity of the regression function.
Per capita earnings. A sociologist employed linear regression model (2.1) to relate per capita earnings (\(Y\)) to average number of years of schooling (\(X\)) for \(12\) cities. The fitted values \(\hat{Y}_i\) and the semistudentized residuals \(e_i^*\) are given.
Answer:
A plot of the semistudentized residuals against the fitted values is shown below.
earnings = read.csv('CH03PR10.txt', sep = '', header = FALSE,
col.names = c('y', 'e'),
colClasses = c('numeric', 'numeric'))
earnings %>%
ggplot(aes(x = y, y = e)) +
geom_point() +
labs(x = "Predicted Per Capita Earning", y = "Semistudentized Residual",
title = "Predicted Per Capita Earning vs Semistudentized Residual")
There is no pattern between the predicted per capita earning and the semistudentized residuals. However there appears to be one outlier that has a residual much lower than the other predicted per capita earnings.
Answer:
The above plot is redrawn with two lines added indicating the area above and below \(1\) standard deviation.
earnings %>%
ggplot(aes(x = y, y = e)) +
geom_point() +
geom_hline(yintercept = -1,
linetype = "dashed", color = "magenta4") +
geom_hline(yintercept = 1,
linetype = "dashed", color = "magenta4") +
labs(x = "Predicted Per Capita Earning", y = "Semistudentized Residual",
title = "Predicted Per Capita Earning vs Semistudentized Residual")
There appears to be \(4\) semistudentized residuals outside \(\pm 1\) standard deviation. If the normal error model was appropriate, then there should be \(4\) semistudented residuals outside \(\pm 1\) standard deviation, since \(68\%\) of the values shall fall in within \(1\) standard deviation of the mean. In this case, since the size of the data is \(12\), \(68\%\) of it is \(8.16\), which is within \(1\) standard deviation, and the remaining \(3.84\), or \(4\) falls outside.
Drug concentration. A pharmacologist employed linear regression model (2.1) to study the relation between the concentration of a drug in plasma (\(Y\)) and the log-dose of the drug (\(X\)). The residuals and log-dose levels are given.
Answer:
A plot of the residuals against the \(X_i\) values is shown below.
drug = read.csv('CH03PR11.txt', sep = '', header = FALSE,
col.names = c('x', 'e'),
colClasses = c('numeric', 'numeric'))
drug %>%
ggplot(aes(x = x, y = e)) +
geom_point() +
labs(x = "Log Dose of the Drug", y = "Residual",
title = "Log Dose of the Drug vs Residual")
As the log dose of the drug increases, the residual grows in magnitude. This shows that the error variance is not constant.
Answer:
Let the null hypothesis be that the error variance is constant and the alternative hypothesis that the error variance is not constant. Then
breusch.pagan.error = function(df, alpha){
df_new = data.frame(x = df$x, resid = df$e, resid_sq = (df$e)^2)
sse = sum(df$e^2)
model_sq_error = lm.fit_manual(df_new$x, df_new$resid_sq)
pred_sq_error = model_sq_error[1] + model_sq_error[2]*df_new$x
ssr_ast = sum((pred_sq_error - mean(df_new$resid_sq))^2)
chi_sq_BP = round((ssr_ast/2) / ((sse/nrow(df))^2), 3)
if(chi_sq_BP < qchisq(1 - alpha, 1)){
paste("At the alpha level of", alpha, "the test statistic is", chi_sq_BP, "and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant.", sep = ' ')
}
else{
paste("At the alpha level of", alpha, "the test statistic is", chi_sq_BP, "and the null hypothesis is rejected. There is evidence that the error variance is nonconstant.", sep = ' ')
}
}
breusch.pagan.error(drug, .05)
## [1] "At the alpha level of 0.05 the test statistic is 3.718 and the null hypothesis is failed to be rejected. There is no sufficent evidence that the error variance is nonconstant."
A student does not understand why the sum of squares defined in (3.16) is called a pure error sum of squares “since the formula looks like one for an ordinary sum of squares” Explain.
Answer:
The equation given in (3.16) is \[ \text{SSE}(F) = \sum_j \sum_i (Y_{ij} - \overline{Y}_{j})^2 = \text{SSPE} \] This is called the pure error sum of squares because it quantifies how much of a lack of fit in \(Y_{ij}\) there is from the estimated value for \(Y_{ij}\). It is different from the formula for the ordinary sum of squares because it is derived from the notion that the full model for a simple linear regression model is \[ Y_{ij} = \mu_j + \varepsilon_{ij} \] where here there is no restrictions on the means \(\mu_j\) (\(E[Y_{ij}] = \mu_j\)). In the regression model, the mean responses are linearly related to \(X\), or \(E[Y] = \beta_0 + \beta_1X\).
Refer to Copier maintenance Problem 1.20.
Answer:
Answer:
Answer:
Refer to Plastic hardness Problem 1.22.
Answer:
Answer:
Answer:
Solution concentration. A chemist studied the concentration of a solution (\(Y\)) over time (\(X\)). Fifteen identical solutions were prepared. The \(15\) solutions were randomly divided into five sets of three, and the five sets were measured, respectively, after \(1\), \(3\), \(5\), \(7\) and \(9\) hours.
Answer:
Answer:
Answer:
Refer to Solution concentration Problem 3.15.
Answer:
Answer:
Answer:
Answer:
Answer:
Answer:
Sales growth. A marketing researcher studied annual sales of a product that had been introduced 10 years ago. The data is given as follows, where \(X\) is the year (coded) and \(Y\) is sales in thousands of units.
Answer:
Answer:
Answer:
Answer:
Answer:
Answer:
Production time. In a manufacturing study, the production times for \(111\) recent production runs were obtained. The table lists for each run the production time in hours (\(Y\)) and the production lot size (\(X\)).
Answer:
Answer:
Answer:
Answer:
Answer:
A student fitted a linear regression function for a class assignment. The student plotted the residuals \(e_i\) against \(Y_i\) and found a positive relation. When the residuals were plotted against the fitted values \(\hat{Y}_i\), the student found no relation. How could this difference arise? Which is the more meaningful plot?
Answer:
If the error terms in a regression model are independent \(N(0, \sigma^2)\), what can be said about the error terms after transformation \(X' = \frac{1}{X}\) is used? Is the situation the same after transformation \(Y' = \frac{1}{Y}\) is used?
Answer:
Derive the result in (3.29).
Answer:
Using (A.70), (A.41) and (A.42), show that \(\text{E[MSPE]} = \sigma^2\) for normal error regression model (2.1).
Answer:
A linear regression model with intercept \(\beta_0=0\) is under consideration. Data have been obtained that contain replications. State the full and reduced models for testing the appropriateness of the regression function under consideration. What are the degrees of freedom associated with the full and reduced models if \(n=20\) and \(c=10\)?
Answer:
Blood pressure. Data was obtained in a study of the relation between diastolic blood pressure (\(Y\)) and age (\(X\)) for boys 5 to 13 years old.
Answer:
Answer:
Answer:
Refer to the CDI data set in Appendix C.2 and Project 1.43. For each of the three fitted regression models, obtain the residuals and prepare a residual plot against \(X\) and a normal probability plot. Summarize your conclusions. Is linear regression model (2.1) more appropriate in one case than in the others?
Answer:
Refer to the CDI data set in Appendix C.2 and Project 1.44. For each geographic region, obtain the residuals and prepare a residual plot against \(X\) and a normal probability plot. Do the four regions appear to have similar error variances? What other conclusions do you draw from your plots?
Answer:
Refer to the SENIC data set in Appendix C.1 and Project 1.45.
Answer:
Answer:
Refer to the SENIC data set in Appendix C.1 and Project 1.46. For each geographic region, obtain the residuals and prepare a residual plot against \(X\) and a normal probability plot. Do the four regions appear to have similar error variance? What other conclusions do you draw from your plots?
Answer:
Refer to Copier maintenance Problem 1.20.
Answer:
Answer:
Answer:
Refer to Sales growth Problem 3.17.
Answer:
Answer:
Answer:
Refer to the Real estate sales data set in Appendix C.7. Obtain a random sample of \(200\) cases from the \(522\) cases in this data set. Using the random sample, build a regression model to predict sales price (\(Y\)) as a function of finished square feet (\(X\)). The analysis should include an assessment of the degree to which the key regression assumptions are satisfied. If the regression assumptions are not met, include and justify appropriate remeidal measures. Use the final model to predict sales price for two houses that are about tocome on the market: the first has \(X=1100\) finished square feet and the second has \(X=4900\) finished square feet. Assess the strengths and weaknesses of the final model.
Answer:
Refer to the Prostate cancer data set in Appendix C.5. Build a regression model to predict PSA lebel (\(Y\)) as a function of cancer volume (\(X\)). The analysis should include an assessment of the degree to which the key regression assumptions are satisfied. If the regression assumptions are not met, include and justify appropriate remedial measures. Use the final model to estimate mean PSA level for a patient whose cancer volume is \(20\) cc. Assess the strengths and weaknesses of the final model.
Answer: